Unconstrained Language Identification Using A Shape Codebook

نویسندگان

  • Guangyu Zhu
  • Xiaodong Yu
  • Yi Li
  • David Doermann
چکیده

We propose a novel approach to language identification in document images containing handwriting and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each representing a characteristic shape feature that is generic enough to appear repeatably. We learn a concise, structurally indexed shape codebook from training data by clustering similar features and partitioning the feature space by graph cuts. Our approach is segmentation free and easily extensible. We quantitatively evaluate our approach using a large real-world document image collection, which consists of more than 1, 500 documents in 8 languages (Arabic, Chinese, English, Hindi, Japanese, Korean, Russian, and Thai) and contains a complex mixture of handwritten and machine printed content. Experimental results demonstrate the robustness and flexibility of our approach, and show exceptional language identification performance that exceeds

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language identification for handwritten document images using a shape codebook

Language identification for handwritten document images is an open document analysis problem. In this paper, we propose a novel approach to language identification for documents containing mixture of handwritten and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each repres...

متن کامل

Vision-Based Sign Language Recognition Using Sign-Wise Tied Mixture HMM

In this paper, a new sign-wise tied mixture HMM (SWTMHMM) is proposed and applied in vision-based sign language recognition (SLR). In the SWTMHMM, the mixture densities of the same sign model are tied so that the states belonging to the same sign share a common local codebook, which leads to robust model parameters estimation and efficient computation of probability densities. For the sign feat...

متن کامل

Object Detection Using A Shape Codebook

This paper presents a method for detecting categories of objects in real-world images. Given training images of an object category, our goal is to recognize and localize instances of those objects in a candidate image. The main contribution of this work is a novel structure of the shape codebook for object detection. A shape codebook entry consists of two components: a shape codeword and a grou...

متن کامل

Object Detection Using Shape Codebook

This paper presents a method for detecting categories of objects in real-world images. Given training images of an object category, our goal is to recognize and localize instances of those objects in a candidate image. The main contribution of this work is a novel structure of the shape codebook for object detection. A shape codebook entry consists of two components: a shape codeword and a grou...

متن کامل

Multigrams for language identification

In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008